{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing and Analyzing Jigsaw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous section, we explored how to generate topics from a textual dataset using LDA. But how can this be used as an application? \n",
    "\n",
    "Therefore, in this section, we will look into the possible ways to read the topics as well as understand how it can be used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now import the preloaded data of the LDA result that was achieved in the previous section. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/topics.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>Topic 0 words</th>\n",
       "      <th>Topic 0 weights</th>\n",
       "      <th>Topic 1 words</th>\n",
       "      <th>Topic 1 weights</th>\n",
       "      <th>Topic 2 words</th>\n",
       "      <th>Topic 2 weights</th>\n",
       "      <th>Topic 3 words</th>\n",
       "      <th>Topic 3 weights</th>\n",
       "      <th>Topic 4 words</th>\n",
       "      <th>...</th>\n",
       "      <th>Topic 5 words</th>\n",
       "      <th>Topic 5 weights</th>\n",
       "      <th>Topic 6 words</th>\n",
       "      <th>Topic 6 weights</th>\n",
       "      <th>Topic 7 words</th>\n",
       "      <th>Topic 7 weights</th>\n",
       "      <th>Topic 8 words</th>\n",
       "      <th>Topic 8 weights</th>\n",
       "      <th>Topic 9 words</th>\n",
       "      <th>Topic 9 weights</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>trump</td>\n",
       "      <td>3452.3</td>\n",
       "      <td>mental</td>\n",
       "      <td>3351.9</td>\n",
       "      <td>canada</td>\n",
       "      <td>591.5</td>\n",
       "      <td>mental</td>\n",
       "      <td>1186.5</td>\n",
       "      <td>gun</td>\n",
       "      <td>...</td>\n",
       "      <td>school</td>\n",
       "      <td>840.5</td>\n",
       "      <td>mental</td>\n",
       "      <td>1058.1</td>\n",
       "      <td>white</td>\n",
       "      <td>1220.1</td>\n",
       "      <td>mental</td>\n",
       "      <td>1836.1</td>\n",
       "      <td>god</td>\n",
       "      <td>954.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>presid</td>\n",
       "      <td>1031.5</td>\n",
       "      <td>ill</td>\n",
       "      <td>1993.1</td>\n",
       "      <td>muslim</td>\n",
       "      <td>582.0</td>\n",
       "      <td>peopl</td>\n",
       "      <td>708.3</td>\n",
       "      <td>mental</td>\n",
       "      <td>...</td>\n",
       "      <td>kid</td>\n",
       "      <td>723.0</td>\n",
       "      <td>comment</td>\n",
       "      <td>848.3</td>\n",
       "      <td>peopl</td>\n",
       "      <td>1076.2</td>\n",
       "      <td>peopl</td>\n",
       "      <td>1793.0</td>\n",
       "      <td>one</td>\n",
       "      <td>934.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>vote</td>\n",
       "      <td>813.8</td>\n",
       "      <td>health</td>\n",
       "      <td>1213.7</td>\n",
       "      <td>countri</td>\n",
       "      <td>539.3</td>\n",
       "      <td>drug</td>\n",
       "      <td>555.8</td>\n",
       "      <td>peopl</td>\n",
       "      <td>...</td>\n",
       "      <td>year</td>\n",
       "      <td>590.5</td>\n",
       "      <td>like</td>\n",
       "      <td>678.6</td>\n",
       "      <td>black</td>\n",
       "      <td>651.0</td>\n",
       "      <td>health</td>\n",
       "      <td>1464.6</td>\n",
       "      <td>women</td>\n",
       "      <td>905.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>like</td>\n",
       "      <td>780.9</td>\n",
       "      <td>medic</td>\n",
       "      <td>706.8</td>\n",
       "      <td>us</td>\n",
       "      <td>519.8</td>\n",
       "      <td>ill</td>\n",
       "      <td>538.9</td>\n",
       "      <td>law</td>\n",
       "      <td>...</td>\n",
       "      <td>go</td>\n",
       "      <td>514.7</td>\n",
       "      <td>would</td>\n",
       "      <td>668.2</td>\n",
       "      <td>disord</td>\n",
       "      <td>537.1</td>\n",
       "      <td>homeless</td>\n",
       "      <td>1367.5</td>\n",
       "      <td>life</td>\n",
       "      <td>830.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>elect</td>\n",
       "      <td>579.5</td>\n",
       "      <td>http</td>\n",
       "      <td>630.5</td>\n",
       "      <td>world</td>\n",
       "      <td>490.3</td>\n",
       "      <td>health</td>\n",
       "      <td>497.7</td>\n",
       "      <td>kill</td>\n",
       "      <td>...</td>\n",
       "      <td>time</td>\n",
       "      <td>507.9</td>\n",
       "      <td>think</td>\n",
       "      <td>650.4</td>\n",
       "      <td>person</td>\n",
       "      <td>529.5</td>\n",
       "      <td>care</td>\n",
       "      <td>1296.8</td>\n",
       "      <td>peopl</td>\n",
       "      <td>798.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0 Topic 0 words  Topic 0 weights Topic 1 words  Topic 1 weights  \\\n",
       "0           0         trump           3452.3        mental           3351.9   \n",
       "1           1        presid           1031.5           ill           1993.1   \n",
       "2           2          vote            813.8        health           1213.7   \n",
       "3           3          like            780.9         medic            706.8   \n",
       "4           4         elect            579.5          http            630.5   \n",
       "\n",
       "  Topic 2 words  Topic 2 weights Topic 3 words  Topic 3 weights Topic 4 words  \\\n",
       "0        canada            591.5        mental           1186.5           gun   \n",
       "1        muslim            582.0         peopl            708.3        mental   \n",
       "2       countri            539.3          drug            555.8         peopl   \n",
       "3            us            519.8           ill            538.9           law   \n",
       "4         world            490.3        health            497.7          kill   \n",
       "\n",
       "   ...  Topic 5 words Topic 5 weights  Topic 6 words Topic 6 weights  \\\n",
       "0  ...         school           840.5         mental          1058.1   \n",
       "1  ...            kid           723.0        comment           848.3   \n",
       "2  ...           year           590.5           like           678.6   \n",
       "3  ...             go           514.7          would           668.2   \n",
       "4  ...           time           507.9          think           650.4   \n",
       "\n",
       "   Topic 7 words Topic 7 weights  Topic 8 words Topic 8 weights  \\\n",
       "0          white          1220.1         mental          1836.1   \n",
       "1          peopl          1076.2          peopl          1793.0   \n",
       "2          black           651.0         health          1464.6   \n",
       "3         disord           537.1       homeless          1367.5   \n",
       "4         person           529.5           care          1296.8   \n",
       "\n",
       "   Topic 9 words Topic 9 weights  \n",
       "0            god           954.9  \n",
       "1            one           934.0  \n",
       "2          women           905.2  \n",
       "3           life           830.1  \n",
       "4          peopl           798.2  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will visualize these results to understand what major themes are present in them.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<iframe src='https://flo.uri.sh/story/941631/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941631/?utm_source=embed&utm_campaign=story/941631' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "\n",
    "<iframe src='https://flo.uri.sh/story/941631/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941631/?utm_source=embed&utm_campaign=story/941631' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An Overview of the analysis \n",
    "From the above visualization, an anomaly that we come across is that the dataset we are examining is supposed to be related to people with physical, mental, and learning disabilities. But unfortunately, based on the topics that were extracted, we notice just a small subset of words that are related to this topic. \n",
    "Topic 2 has words that address themes related to what we were expecting the dataset to have. But the major theme that was noticed in the Top 5 topics are main terms that are political. \n",
    "(The Top 10 topics show themes related to Religion as well, which is quite interesting.)\n",
    "LDA hence helped in understanding what the conversations the dataset consisted. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the word collection, we also notice that there were certain words such as \\'kill' that can be categorized as \\'Toxic'\\. To analyze this more, we can classify each word because it can be categorized wi by an NLP classifier. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate an example of a toxic analysis framework, the below code shows the working of the Unitary library in python. {cite}`Detoxify`\n",
    "\n",
    "This library provides a toxicity score (from a scale of 0 to 1) for the sentence that is passed through it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "headers = {\"Authorization\": f\"Bearer api_ZtUEFtMRVhSLdyTNrRAmpxXgMAxZJpKLQb\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get access to this software, you will need to get an API KEY at https://huggingface.co/unitary/toxic-bert\n",
    "Here is an example of what this would look like.\n",
    "```python\n",
    "headers = {\"Authorization\": f\"Bearer api_XXXXXXXXXXXXXXXXXXXXXXXXXXX\"}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "API_URL = \"https://api-inference.huggingface.co/models/unitary/toxic-bert\"\n",
    "\n",
    "def query(payload):\n",
    "    response = requests.post(API_URL, headers=headers, json=payload)\n",
    "    return response.json()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[{'label': 'toxic', 'score': 0.9272779822349548},\n",
       "  {'label': 'severe_toxic', 'score': 0.00169223896227777},\n",
       "  {'label': 'obscene', 'score': 0.03694247826933861},\n",
       "  {'label': 'threat', 'score': 0.0017220545560121536},\n",
       "  {'label': 'insult', 'score': 0.02829463966190815},\n",
       "  {'label': 'identity_hate', 'score': 0.004070617724210024}]]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query({\"inputs\": \"addict\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can input words or sentences in \\<insert word here>, in the code, to look at the results that are generated through this.\n",
    "\n",
    "This example can provide an idea as to how ML can be used for toxicity analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[{'label': 'toxic', 'score': 0.5101907849311829},\n",
       "  {'label': 'severe_toxic', 'score': 0.07646821439266205},\n",
       "  {'label': 'obscene', 'score': 0.12113521993160248},\n",
       "  {'label': 'threat', 'score': 0.07763686031103134},\n",
       "  {'label': 'insult', 'score': 0.11923719942569733},\n",
       "  {'label': 'identity_hate', 'score': 0.09533172845840454}]]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query({\"inputs\": \"<insert word here>\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<iframe src='https://flo.uri.sh/story/941681/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941681/?utm_source=embed&utm_campaign=story/941681' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "\n",
    "<iframe src='https://flo.uri.sh/story/941681/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941681/?utm_source=embed&utm_campaign=story/941681' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The Bias\n",
    "The visualization shows how contextually toxic words are derived as important words within various topics related to this dataset. These toxic words can lead to any Natural Language Processing kernel learning this dataset to provide skewed analysis for the population in consideration, i.e., people with mental, physical, and learning disabilities. This can lead to very discriminatory classifications. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### An Example\n",
    "To illustrate the impact better, we will be taking the most associated words to the word 'mental' from the results. Below is a network graph that shows the commonly associated words. It is seen that words such as 'Kill' and 'Gun' appear with the closest association. This can lead to the machine contextualizing the word 'mental' to be associated with such words. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe src='https://flo.uri.sh/visualisation/6867000/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6867000/?utm_source=embed&utm_campaign=visualisation/6867000' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<iframe src='https://flo.uri.sh/visualisation/6867000/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6867000/?utm_source=embed&utm_campaign=visualisation/6867000' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is hence essential to be aware of the dataset that is being used to analyze a specific population. With LDA, we were able to understand that this dataset cannot be used as a good representation of the disabled community. To bring about a movement of unbiased AI, we need to perform such preliminary analysis and more not to cause unintended discrimination. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Dashboard\n",
    "\n",
    "Below is the complete data visualization dashboard of the topic analysis. Feel feel to experiment and compare various labels to your liking. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<iframe src='https://flo.uri.sh/visualisation/6856937/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6856937/?utm_source=embed&utm_campaign=visualisation/6856937' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "\n",
    "<iframe src='https://flo.uri.sh/visualisation/6856937/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6856937/?utm_source=embed&utm_campaign=visualisation/6856937' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thank you!\n",
    "\n",
    "We thank you for your time! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}